\section{Parallelization layer}
\label{sec:parallelization}
Mitsuba is built on top of a flexible parallelization layer, which spreads out
various types of computation over local and remote cores.
The guiding principle is that if an operation can potentially take longer than a
few seconds, it ought to use all the cores it can get.

Here, we will go through a basic example, which will hopefully provide sufficient intuition
to realize more complex tasks.
To obtain good (i.e. close to linear) speedups, the parallelization layer depends on
several key assumptions of the task to be parallelized:
\begin{itemize}
\item The task can easily be split up into a discrete number of \emph{work units}, which requires a negligible amount of computation.
\item Each work unit is small in footprint so that it can easily be transferred over the network or shared memory.
\item A work unit constitutes a significant amount of computation, which by far outweighs the cost of transmitting it to another node.
\item The \emph{work result} obtained by processing a work unit is again small in footprint, so that it can easily be transferred back.
\item Merging all work results to a solution of the whole problem requires a negligible amount of additional computation.
\end{itemize}
This essentially corresponds to a parallel version of \emph{Map} (one part of \emph{Map\&Reduce}) and is
ideally suited for most rendering workloads.

The example we consider here computes a \code{ROT13} ``encryption'' of a string, which
most certainly violates the ``significant amount of computation'' assumption.
It was chosen due to the inherent parallelism and simplicity of this task.
While of course over-engineered to the extreme, the example hopefully
communicates how this framework might be used in more complex scenarios.

We will implement this program as a plugin for the utility launcher \code{mtsutil}, which
frees us from having to write lots of code to set up the framework, prepare the
scheduler, etc.

We start by creating the utility skeleton file \code{src/utils/rot13.cpp}:
\begin{cpp}
#include <mitsuba/render/util.h>

MTS_NAMESPACE_BEGIN

class ROT13Encoder : public Utility {
public:
    int run(int argc, char **argv) {
        cout << "Hello world!" << endl;
        return 0;
    }

    MTS_DECLARE_UTILITY()
};

MTS_EXPORT_UTILITY(ROT13Encoder, "Perform a ROT13 encryption of a string")
MTS_NAMESPACE_END
\end{cpp}
The file must also be added to the build system: insert the line
\begin{shell}
plugins += $\texttt{env}$.SharedLibrary('rot13', ['rot13.cpp'])
\end{shell}
into the \code{utils/SConscript} file. After compiling
using \code{scons}, the \code{mtsutil} binary should automatically pick up your new utility plugin:
\begin{shell}
$\texttt{\$}$ mtsutil
..
The following utilities are available:

    addimages             Generate linear combinations of EXR images
    rot13                 Perform a ROT13 encryption of a string
\end{shell}
It can be executed as follows:
\begin{shell}
$\texttt{\$}$ mtsutil rot13
2010-08-16 18:38:27 INFO  main [src/mitsuba/mtsutil.cpp:276] Mitsuba version 0.1.1, Copyright (c) 2010 Wenzel Jakob
2010-08-16 18:38:27 INFO  main [src/mitsuba/mtsutil.cpp:350] Loading utility "rot13" ..
Hello world!
\end{shell}

Our approach for implementing distributed ROT13 will be to treat each character as an
indpendent work unit. Since the ordering is lost when sending out work units, we must
also include the position of the character in both the work units and the work results.

All of the relevant interfaces are contained in \code{include/mitsuba/core/sched.h}.
For reference, here are the interfaces of \code{WorkUnit} and \code{WorkResult}:
\begin{cpp}
/**
 * Abstract work unit. Represents a small amount of information
 * that encodes part of a larger processing task.
 */
class MTS_EXPORT_CORE WorkUnit : public Object {
public:
    /// Copy the content of another work unit of the same type
    virtual void set(const WorkUnit *workUnit) = 0;

    /// Fill the work unit with content acquired from a binary data stream
    virtual void load(Stream *stream) = 0;

    /// Serialize a work unit to a binary data stream
    virtual void save(Stream *stream) const = 0;

    /// Return a string representation
    virtual std::string toString() const = 0;

    MTS_DECLARE_CLASS()
protected:
    /// Virtual destructor
    virtual ~WorkUnit() { }
};
/**
 * Abstract work result. Represents the information that encodes
 * the result of a processed <tt>WorkUnit</tt> instance.
 */
class MTS_EXPORT_CORE WorkResult : public Object {
public:
    /// Fill the work result with content acquired from a binary data stream
    virtual void load(Stream *stream) = 0;

    /// Serialize a work result to a binary data stream
    virtual void save(Stream *stream) const = 0;

    /// Return a string representation
    virtual std::string toString() const = 0;

    MTS_DECLARE_CLASS()
protected:
    /// Virtual destructor
    virtual ~WorkResult() { }
};
\end{cpp}
In our case, the \code{WorkUnit} implementation then looks like this:
\begin{cpp}
class ROT13WorkUnit : public WorkUnit {
public:
    void set(const WorkUnit *workUnit) {
        const ROT13WorkUnit *wu =
            static_cast<const ROT13WorkUnit *>(workUnit);
        m_char = wu->m_char;
        m_pos = wu->m_pos;
    }

    void load(Stream *stream) {
        m_char = stream->readChar();
        m_pos = stream->readInt();
    }

    void save(Stream *stream) const {
        stream->writeChar(m_char);
        stream->writeInt(m_pos);
    }

    std::string toString() const {
        std::ostringstream oss;
        oss << "ROT13WorkUnit[" << endl
            << "  char = '" << m_char << "'," << endl
            << "  pos = " << m_pos << endl
            << "]";
        return oss.str();
    }

    inline char getChar() const { return m_char; }
    inline void setChar(char value) { m_char = value; }
    inline int getPos() const { return m_pos; }
    inline void setPos(int value) { m_pos = value; }

    MTS_DECLARE_CLASS()
private:
    char m_char;
    int m_pos;
};

MTS_IMPLEMENT_CLASS(ROT13WorkUnit, false, WorkUnit)
\end{cpp}
The \code{ROT13WorkResult} implementation is not reproduced since it is almost identical
(except that it doesn't need the \code{set} method).
The similarity is not true in general: for most algorithms, the work unit and result
will look completely different.

Next, we need a class, which does the actual work of turning a work unit into a work result
(a subclass of \code{WorkProcessor}). Again, we need to implement a range of support
methods to enable the various ways in which work processor instances will be submitted to
remote worker nodes and replicated amongst local threads.
\begin{cpp}
class ROT13WorkProcessor : public WorkProcessor {
public:
    /// Construct a new work processor
    ROT13WorkProcessor() : WorkProcessor() { }

    /// Unserialize from a binary data stream (nothing to do in our case)
    ROT13WorkProcessor(Stream *stream, InstanceManager *manager)
        : WorkProcessor(stream, manager) { }

    /// Serialize to a binary data stream (nothing to do in our case)
    void serialize(Stream *stream, InstanceManager *manager) const {
    }

    ref<WorkUnit> createWorkUnit() const {
        return new ROT13WorkUnit();
    }

    ref<WorkResult> createWorkResult() const {
        return new ROT13WorkResult();
    }

    ref<WorkProcessor> clone() const {
        return new ROT13WorkProcessor(); // No state to clone in our case
    }

    /// No internal state, thus no preparation is necessary
    void prepare() { }

    /// Do the actual computation
    void process(const WorkUnit *workUnit, WorkResult *workResult,
                 const bool &stop) {
        const ROT13WorkUnit *wu
            = static_cast<const ROT13WorkUnit *>(workUnit);
        ROT13WorkResult *wr = static_cast<ROT13WorkResult *>(workResult);
        wr->setPos(wu->getPos());
        wr->setChar((std::toupper(wu->getChar()) - 'A' + 13) % 26 + 'A');
    }
    MTS_DECLARE_CLASS()
};
MTS_IMPLEMENT_CLASS_S(ROT13WorkProcessor, false, WorkProcessor)
\end{cpp}
Since our work processor has no state, most of the implementations
are rather trivial. Note the \code{stop} field in the \code{process}
method. This field is used to abort running jobs at the users requests, hence
it is a good idea to periodically check its value during lengthy computations.

Finally, we need a so-called \emph{parallel process}
instance, which is responsible for creating work units and stitching
work results back into a solution of the whole problem. The \code{ROT13}
implementation might look as follows:
\begin{cpp}
class ROT13Process : public ParallelProcess {
public:
    ROT13Process(const std::string &input) : m_input(input), m_pos(0) {
        m_output.resize(m_input.length());
    }

    ref<WorkProcessor> createWorkProcessor() const {
        return new ROT13WorkProcessor();
    }

    std::vector<std::string> getRequiredPlugins() {
        std::vector<std::string> result;
        result.push_back("rot13");
        return result;
    }

    EStatus generateWork(WorkUnit *unit, int worker /* unused */) {
        if (m_pos >= (int) m_input.length())
            return EFailure;
        ROT13WorkUnit *wu = static_cast<ROT13WorkUnit *>(unit);

        wu->setPos(m_pos);
        wu->setChar(m_input[m_pos++]);

        return ESuccess;
    }

    void processResult(const WorkResult *result, bool cancelled) {
        if (cancelled) // indicates a work unit, which was
            return;    // cancelled partly through its execution
        const ROT13WorkResult *wr =
            static_cast<const ROT13WorkResult *>(result);
        m_output[wr->getPos()] = wr->getChar();
    }

    inline const std::string &getOutput() {
        return m_output;
    }

    MTS_DECLARE_CLASS()
public:
    std::string m_input;
    std::string m_output;
    int m_pos;
};
MTS_IMPLEMENT_CLASS(ROT13Process, false, ParallelProcess)
\end{cpp}
The \code{generateWork} method produces work units until we have moved past
the end of the string, after which it returns the status code \code{EFailure}.
Note the method \code{getRequiredPlugins()}: this is necessary to use
the utility across
machines. When communicating with another node, it ensures that the remote side
loads the \code{ROT13*} classes at the right moment.

To actually use the \code{ROT13} encoder, we must first launch the newly created parallel process
from the main utility function (the `Hello World' code we wrote earlier). We can adapt it as follows:
\begin{cpp}
    int run(int argc, char **argv) {
        if (argc < 2) {
            cout << "Syntax: mtsutil rot13 <text>" << endl;
            return -1;
        }

        ref<ROT13Process> proc = new ROT13Process(argv[1]);
        ref<Scheduler> sched = Scheduler::getInstance();

        /* Submit the encryption job to the scheduler */
        sched->schedule(proc);

        /* Wait for its completion */
        sched->wait(proc);

        cout << "Result: " << proc->getOutput() << endl;

        return 0;
    }
\end{cpp}
After compiling everything using \code{scons}, a simple example
involving the utility would be to encode a string (e.g. \code{SECUREBYDESIGN}), while
forwarding all computation to a network machine. (\code{-p0} disables
all local worker threads). Adding a verbose flag (\code{-v}) shows
some additional scheduling information:
\begin{shell}
$\texttt{\$}$ mtsutil -vc feynman -p0 rot13 SECUREBYDESIGN
2010-08-17 01:35:46 INFO  main [src/mitsuba/mtsutil.cpp:201] Mitsuba version 0.1.1, Copyright (c) 2010 Wenzel Jakob
2010-08-17 01:35:46 INFO  main [SocketStream] Connecting to "feynman:7554"
2010-08-17 01:35:46 DEBUG main [Thread] Spawning thread "net0_r"
2010-08-17 01:35:46 DEBUG main [RemoteWorker] Connection to "feynman" established (2 cores).
2010-08-17 01:35:46 DEBUG main [Scheduler] Starting ..
2010-08-17 01:35:46 DEBUG main [Thread] Spawning thread "net0"
2010-08-17 01:35:46 INFO  main [src/mitsuba/mtsutil.cpp:275] Loading utility "rot13" ..
2010-08-17 01:35:46 DEBUG main [Scheduler] Scheduling process 0: ROT13Process[unknown]..
2010-08-17 01:35:46 DEBUG main [Scheduler] Waiting $\texttt{for}$ process 0
2010-08-17 01:35:46 DEBUG net0 [Scheduler] Process 0 has finished generating work
2010-08-17 01:35:46 DEBUG net0_r[Scheduler] Process 0 is $\texttt{complete}$.
Result: FRPHEROLQRFVTA
2010-08-17 01:35:46 DEBUG main [Scheduler] Pausing ..
2010-08-17 01:35:46 DEBUG net0 [Thread] Thread "net0" has finished
2010-08-17 01:35:46 DEBUG main [Scheduler] Stopping ..
2010-08-17 01:35:46 DEBUG main [RemoteWorker] Shutting down
2010-08-17 01:35:46 DEBUG net0_r[Thread] Thread "net0_r" has finished
\end{shell}
